In [1]:
import warnings

# Ignore future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In [2]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [15]:
# First, let's load the dataset and examine its structure to proceed with the analysis.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
In [4]:
# Load the dataset
file_path = '/content/Final Data.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure and variables
data.head()
Out[4]:
Date Cabbage Capsicum Cauliflower Chicken Cucumber eddoe Egg Egg_Plant Garlic ... Pakistani_Banana Potato Pumpkin Tomato Zucchini Economic_Policy_Uncertainty_Index Temperature Rain Petrol_Prices USD_PKR
0 01/01/2017 30 64 26 112 48 60 115 28 296 ... 75 27 44 26 27 45.36 20 0 66.51 104.375
1 02/01/2017 36 67 23 112 44 70 116 40 330 ... 80 26 50 28 26 45.36 19 0 66.51 104.750
2 03/01/2017 35 65 24 112 41 74 118 40 330 ... 88 27 60 30 27 45.36 17 1 66.51 104.785
3 04/01/2017 35 68 25 115 44 70 118 52 330 ... 90 28 65 32 25 45.36 15 1 66.51 104.810
4 05/01/2017 33 66 23 97 42 72 121 40 335 ... 85 27 50 29 25 45.36 17 0 66.51 104.700

5 rows × 24 columns

The dataset includes a variety of variables, such as prices for different vegetables (e.g., Cabbage, Capsicum, Cauliflower), Chicken, economic indicators (Economic Policy Uncertainty Index, Petrol Prices, USD to PKR exchange rate), and environmental factors (Temperature, Rain). "Potato" is identified as the dependent variable, which means the goal is to predict its prices based on other features.

In [ ]:
# Clean the data by converting non-numeric entries to NaN
commodities = [
    'Cabbage', 'Capsicum', 'Cauliflower', 'Chicken', 'Cucumber', 'eddoe',
    'Egg', 'Egg_Plant', 'Garlic', 'Ginger', 'Green Chilli', 'Laddy_Finger',
    'Onion', 'Pakistani_Banana', 'Potato', 'Pumpkin', 'Tomato', 'Zucchini'
]

for commodity in commodities:
    data[commodity] = pd.to_numeric(data[commodity], errors='coerce')

# Continue with our previous processing
data['Date'] = pd.to_datetime(data['Date'])
data['Month'] = data['Date'].dt.month_name()
data['Year'] = data['Date'].dt.year

# Now, we will create a table that contains the average price for each commodity by month and year
table = pd.DataFrame()

# For each commodity, we calculate the average price by month
for commodity in commodities:
    # Group by year and month, and calculate the average
    monthly_avg = data.groupby(['Year', 'Month'])[commodity].mean().reset_index()
    # Merge into the table DataFrame
    if table.empty:
        table = monthly_avg
    else:
        table = table.merge(monthly_avg, on=['Year', 'Month'], how='outer')

# Pivot the table to have months as rows and years as columns for each commodity
pivot_table = table.pivot('Month', 'Year')

# Sort the pivot_table index based on calendar month order
months_order = ['January', 'February', 'March', 'April', 'May', 'June',
                'July', 'August', 'September', 'October', 'November', 'December']
pivot_table = pivot_table.reindex(months_order)

print(pivot_table)  # Display the first few rows of the pivoted table
pivot_table.to_excel('commodity_prices.xlsx')
<ipython-input-5-b0250fb632da>:12: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
  data['Date'] = pd.to_datetime(data['Date'])
             Cabbage                                                          \
Year            2017       2018       2019       2020       2021        2022   
Month                                                                          
January    30.580645  34.683333  35.901639  23.419355  28.032258   58.161290   
February   28.285714  35.555556  27.481481  24.724138  27.178571   76.714286   
March      27.064516  32.766667  28.200000  22.806452  24.225806   65.903226   
April      25.966667  39.181818  48.413793  24.266667  32.600000   68.200000   
May        23.903226  33.666667  53.700000  28.354839  29.774194   77.387097   
June       29.733333  49.310345  51.827586  38.033333  38.366667  125.033333   
July       24.129032  55.933333  44.000000  45.516129  51.580645   93.967742   
August     39.161290  63.066667  37.466667  42.354839  56.806452   85.032258   
September  34.000000  45.689655  45.344828  35.300000  48.266667   74.700000   
October    39.580645  42.000000  41.433333  33.806452  43.677419   99.677419   
November   40.933333  34.827586  34.137931  37.933333  36.766667   82.666667   
December   41.096774  30.300000  28.933333  38.516129  37.064516   56.516129   

             Capsicum                                      ...      Tomato  \
Year             2017        2018        2019        2020  ...        2019   
Month                                                      ...               
January     79.193548   64.850000   79.803279   60.193548  ...   48.918033   
February    87.250000  111.629630  102.074074   71.793103  ...  106.259259   
March       68.645161   64.166667   95.366667   62.935484  ...   79.300000   
April       58.866667   92.000000   69.724138   50.033333  ...   68.482759   
May         59.516129   65.900000   56.200000   59.903226  ...   49.066667   
June        61.366667   60.586207   77.689655   68.466667  ...   52.068966   
July        71.741935   90.566667   90.636364   74.483871  ...   70.636364   
August      85.354839  110.166667   86.233333   84.548387  ...   55.266667   
September   86.300000  114.962963  115.851852   89.566667  ...   65.035714   
October    101.096774   98.400000   97.333333  111.612903  ...   64.833333   
November   111.000000   83.758621   82.965517   96.766667  ...   49.551724   
December    99.838710   87.466667   85.866667   94.677419  ...   48.700000   

                                              Zucchini                        \
Year            2020       2021        2022       2017       2018       2019   
Month                                                                          
January    44.129032  56.129032   92.709677  36.387097  43.150000  50.262295   
February   46.000000  40.000000  124.964286  31.357143  46.333333  48.185185   
March      44.354839  54.580645  100.935484  31.903226  38.466667  51.233333   
April      40.300000  47.100000  101.700000  33.133333  54.181818  48.928571   
May        33.161290  42.225806  100.548387  28.838710  37.633333  29.133333   
June       43.166667  45.000000   97.600000  34.900000  45.586207  40.586207   
July       63.612903  69.612903   89.096774  49.677419  70.800000  49.727273   
August     47.193548  62.225806  125.838710  54.741935  58.566667  47.066667   
September  43.400000  54.033333  131.066667  54.566667  68.000000  66.103448   
October    46.258065  74.870968  165.903226  41.903226  71.300000  67.833333   
November   53.600000  99.533333  157.200000  38.033333  42.310345  40.068966   
December   55.483871  67.064516   86.419355  52.258065  40.933333  38.233333   

                                             
Year            2020       2021        2022  
Month                                        
January    35.870968  44.645161   58.193548  
February   34.000000  40.571429   60.071429  
March      33.580645  41.870968   53.064516  
April      30.000000  35.066667   51.133333  
May        33.645161  47.000000   50.967742  
June       48.766667  67.933333   71.600000  
July       64.032258  83.161290   86.677419  
August     50.548387  93.838710  104.258065  
September  49.233333  91.733333  114.433333  
October    52.935484  76.548387   68.677419  
November   41.133333  50.333333   57.400000  
December   40.806452  48.709677   61.290323  

[12 rows x 108 columns]
In [ ]:
# Seasonal Trends in Potato Prices
plt.figure(figsize=(12, 6))
data['Month'] = pd.to_datetime(data['Date']).dt.month_name()
sns.lineplot(x='Month', y='Potato', data=data, estimator='mean')
plt.xticks(rotation=45)
plt.title('Seasonal Trends in Potato Prices')
plt.show()



# Seasonal Trends in Potato Prices
plt.figure(figsize=(12, 6))
data['Month'] = pd.to_datetime(data['Date']).dt.month_name()
sns.lineplot(x='Month', y='Potato', data=data, estimator='mean')
plt.xticks(rotation=45)
plt.title('Seasonal Trends in Potato Prices')
plt.show()
In [ ]:
# Our DataFrame is named 'data' and we have 'Potato' prices along with 'Temperature' and 'Rain'
plt.figure(figsize=(12, 6))

# Make sure the column names match exactly with those in we DataFrame
# If 'Temperature' is not the correct column name, replace it with the correct one from our dataset
sns.scatterplot(x=data[' Temperature'], y=data['Potato'], color='blue', label='Temperature')

# Similarly, make sure 'Rain' is the correct column name for rainfall data
sns.scatterplot(x=data['Rain'], y=data['Potato'], color='red', label='Rain')

plt.title('Impact of Weather on Potato Prices')
plt.xlabel('Weather Variables')
plt.ylabel('Potato Price')
plt.legend()
plt.show()
In [ ]:
# Inter-Commodity Price Relationship
# This would typically be a pairplot, but for simplicity, we'll do a correlation heatmap
commodities = ['Potato', 'Onion', 'Chicken', 'Egg']  #
plt.figure(figsize=(10, 8))
sns.heatmap(data[commodities].corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Inter-Commodity Price Relationship')
plt.show()

1. Time Series Plot for Potato Prices Over Time¶

In [ ]:
data['Date'] = pd.to_datetime(data['Date'])
data['Month'] = data['Date'].dt.month
monthly_avg = data.groupby('Month')['Potato'].mean()

plt.figure(figsize=(10, 5))
monthly_avg.plot(kind='bar', color='skyblue')
plt.title('Average Monthly Potato Prices')
plt.xlabel('Month')
plt.ylabel('Average Price')
plt.xticks(ticks=range(0, 12), labels=[calendar.month_abbr[i+1] for i in range(0, 12)], rotation=45)
plt.show()
In [ ]:
# Ensure 'Date' is a datetime object
data['Date'] = pd.to_datetime(data['Date'])
data['Year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month

# Calculate monthly average
monthly_avg_prices = data.groupby(['Year', 'Month'])['Potato'].mean().reset_index()

# Find the min and max for each year
min_prices = monthly_avg_prices.loc[monthly_avg_prices.groupby('Year')['Potato'].idxmin()]
max_prices = monthly_avg_prices.loc[monthly_avg_prices.groupby('Year')['Potato'].idxmax()]

# Prepare the data for plotting
plot_data = pd.merge(min_prices, max_prices, on='Year', suffixes=['_min', '_max'])

# Create figure and axis objects
fig, ax = plt.subplots(figsize=(15, 7))

# Plot min prices
ax.bar(plot_data['Year'] - 0.2, plot_data['Potato_min'], color='skyblue', width=0.4, label='Min Avg Price', align='center')

# Plot max prices
ax.bar(plot_data['Year'] + 0.2, plot_data['Potato_max'], color='salmon', width=0.4, label='Max Avg Price', align='center')

# Set the labels and titles
ax.set_xlabel('Year', fontsize=14)
ax.set_ylabel('Average Monthly Potato Price', fontsize=14)
ax.set_title('Min and Max Average Monthly Potato Prices by Year', fontsize=16)
ax.set_xticks(plot_data['Year'])
ax.legend()

# Display the plot
plt.show()
In [ ]:
# Make sure 'Date' is a datetime column
data['Date'] = pd.to_datetime(data['Date'])

# Extract month and year from Date
data['Month'] = data['Date'].dt.month
data['Year'] = data['Date'].dt.year

# Pivot the data to have years as columns and months as rows
monthly_prices = data.pivot_table(index='Month', columns='Year', values='Potato')

# Create a new figure with a specific size
plt.figure(figsize=(15, 7))

# Plot each year's data on the same axis
for year in monthly_prices.columns:
    plt.plot(calendar.month_abbr[1:], monthly_prices[year], label=year)

# Add title and labels
plt.title('Monthly Potato Prices Over Years')
plt.xlabel('Month')
plt.ylabel('Potato Price')

# Add legend
plt.legend(title='Year')

# Show the plot
plt.show()
In [ ]:
# Convert 'Date' to datetime and extract the month
data['Date'] = pd.to_datetime(data['Date'])
data['Month'] = data['Date'].dt.month_name()

# List of commodities to plot. This should match the column names of our dataset exactly.
commodities = [
    'Cabbage', 'Capsicum', 'Cauliflower', 'Chicken', 'Cucumber', 'eddoe',
    'Egg', 'Egg_Plant', 'Garlic', 'Ginger', 'Green Chilli', 'Laddy_Finger',
    'Onion', 'Pakistani_Banana', 'Potato', 'Pumpkin', 'Tomato', 'Zucchini'
]

# Plot seasonal trends for each commodity in separate graphs
for commodity in commodities:
    plt.figure(figsize=(12, 6))
    sns.lineplot(x='Month', y=commodity, data=data, estimator='mean')
    plt.xticks(rotation=45)
    plt.title(f'Seasonal Trends in {commodity} Prices')
    plt.xlabel('Month')
    plt.ylabel('Average Price')
    plt.show()
In [ ]:
 
In [ ]:
data['Year'] = data['Date'].dt.year
yearly_avg = data.groupby('Year')['Potato'].mean()

plt.figure(figsize=(10, 5))
yearly_avg.plot(kind='line', marker='o', color='green')
plt.title('Average Yearly Potato Prices')
plt.xlabel('Year')
plt.ylabel('Average Price')
plt.show()
In [ ]:
data['YearMonth'] = data['Date'].dt.to_period('M')
monthly_prices = data.pivot_table(index='YearMonth', columns=data['Date'].dt.year, values='Potato')

plt.figure(figsize=(15, 7))
monthly_prices.plot(ax=plt.gca())
plt.title('Monthly Potato Prices Over Years')
plt.xlabel('Month')
plt.ylabel('Potato Price')
plt.legend(title='Year')
plt.show()
In [ ]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Month', y='Potato', palette='coolwarm')
plt.title('Monthly Potato Price Distribution')
plt.xlabel('Month')
plt.ylabel('Potato Prices')
plt.show()
In [ ]:
data.set_index('Date', inplace=True)
data['Potato'].rolling(window=30).mean().plot(figsize=(12, 6))  # 30-day moving average
plt.title('30-Day Moving Average of Potato Prices')
plt.xlabel('Date')
plt.ylabel('30-Day Moving Average Price')
plt.show()
In [ ]:
 
In [ ]:
data['Date'] = pd.to_datetime(data['Date'])
plt.figure(figsize=(10, 6))
plt.plot(data['Date'], data['Potato'], label='Potato Prices')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Potato Prices Over Time')
plt.legend()
plt.show()

2. Boxplot for the Distribution of Potato Prices¶

In [ ]:
plt.figure(figsize=(8, 5))
sns.boxplot(data['Potato'])
plt.title('Distribution of Potato Prices')
plt.show()

3. Heatmap of Correlation Matrix¶

In [ ]:
corr = data.drop('Date', axis=1).corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Variables')
plt.show()

4. Pair Plot for Select Variables¶

In [ ]:
sns.pairplot(data[['Potato', 'Chicken', 'Rain', 'Petrol_Prices', 'USD_PKR']])
plt.show()

5. Histogram of Economic Policy Uncertainty Index¶

In [ ]:
plt.figure(figsize=(8, 5))
data['Economic_Policy_Uncertainty_Index'].hist(bins=30)
plt.title('Distribution of Economic Policy Uncertainty Index')
plt.xlabel('Economic Policy Uncertainty Index')
plt.ylabel('Frequency')
plt.show()

6. Scatter Plot of Potato Prices vs. Temperature¶

In [ ]:
plt.figure(figsize=(8, 5))
plt.scatter(data[' Temperature'], data['Potato'])
plt.xlabel(' Temperature')
plt.ylabel('Potato Prices')
plt.title('Potato Prices vs. Temperature')
plt.show()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3801             try:
-> 3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:

/usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ' Temperature'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-9-f7ed1cadc3ed> in <cell line: 2>()
      1 plt.figure(figsize=(8, 5))
----> 2 plt.scatter(data[' Temperature'], data['Potato'])
      3 plt.xlabel(' Temperature')
      4 plt.ylabel('Potato Prices')
      5 plt.title('Potato Prices vs. Temperature')

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   3805             if self.columns.nlevels > 1:
   3806                 return self._getitem_multilevel(key)
-> 3807             indexer = self.columns.get_loc(key)
   3808             if is_integer(indexer):
   3809                 indexer = [indexer]

/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:
-> 3804                 raise KeyError(key) from err
   3805             except TypeError:
   3806                 # If we have a listlike key, _check_indexing_error will raise

KeyError: ' Temperature'
<Figure size 800x500 with 0 Axes>

7. Bar Plot of Average Prices of Vegetables¶

In [ ]:
veggie_prices = data[['Cabbage', 'Capsicum', 'Cauliflower', 'Cucumber', 'Egg_Plant', 'Tomato']].mean()
veggie_prices.plot(kind='bar')
plt.title('Average Prices of Vegetables')
plt.ylabel('Average Price')
plt.xticks(rotation=45)
plt.show()

8. Line Plot for Petrol Prices Over Time¶

In [ ]:
plt.figure(figsize=(10, 6))
plt.plot(data['Date'], data['Petrol_Prices'], label='Petrol Prices', color='red')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Petrol Prices Over Time')
plt.legend()
plt.show()

9. Violin Plot for the Distribution of Rainfall¶

In [ ]:
plt.figure(figsize=(8, 5))
sns.violinplot(data=data, y='Rain')
plt.title('Distribution of Rainfall')
plt.show()

10. Time Series Plot for Multiple Vegetables¶

In [ ]:
plt.figure(figsize=(10, 6))
plt.plot(data['Date'], data['Cabbage'], label='Cabbage')
plt.plot(data['Date'], data['Tomato'], label='Tomato')
plt.plot(data['Date'], data['Capsicum'], label='Capsicum')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Prices of Various Vegetables Over Time')
plt.legend()
plt.show()
In [5]:
# Convert all columns to numeric, except for the 'Date' column, which will be ignored in the analysis
data_new = data.drop(columns=['Date']) # Dropping the Date column as it won't be used in the model

data_new = data_new.apply(pd.to_numeric, errors='coerce') # Convert all columns to numeric, errors='coerce' will set invalid parsing to NaN

# Now, let's check for any NaN values in the dataset and decide how to handle them
nan_summary = data.isna().sum()

# Display the summary of NaN values across all columns
nan_summary
Out[5]:
Date                                 0
Cabbage                              0
Capsicum                             0
Cauliflower                          0
Chicken                              0
Cucumber                             0
eddoe                                0
Egg                                  0
Egg_Plant                            0
Garlic                               0
Ginger                               0
Green Chilli                         0
Laddy_Finger                         0
Onion                                0
Pakistani_Banana                     0
Potato                               0
Pumpkin                              0
Tomato                               0
Zucchini                             0
Economic_Policy_Uncertainty_Index    0
Temperature                          0
Rain                                 0
Petrol_Prices                        0
USD_PKR                              0
dtype: int64

The dataset contains missing values in several columns, including "Capsicum," "Garlic," "Green Chilli," "Laddy_Finger," "Pakistani_Banana," "Potato," "Tomato," and "Zucchini." To proceed with a clean dataset, we'll fill these missing values. Given the nature of the data (time series of commodity prices and environmental factors), a common approach would be to fill missing values with the column mean, which can maintain the overall distribution without introducing bias.

In [6]:
data.head()
Out[6]:
Date Cabbage Capsicum Cauliflower Chicken Cucumber eddoe Egg Egg_Plant Garlic ... Pakistani_Banana Potato Pumpkin Tomato Zucchini Economic_Policy_Uncertainty_Index Temperature Rain Petrol_Prices USD_PKR
0 01/01/2017 30 64 26 112 48 60 115 28 296 ... 75 27 44 26 27 45.36 20 0 66.51 104.375
1 02/01/2017 36 67 23 112 44 70 116 40 330 ... 80 26 50 28 26 45.36 19 0 66.51 104.750
2 03/01/2017 35 65 24 112 41 74 118 40 330 ... 88 27 60 30 27 45.36 17 1 66.51 104.785
3 04/01/2017 35 68 25 115 44 70 118 52 330 ... 90 28 65 32 25 45.36 15 1 66.51 104.810
4 05/01/2017 33 66 23 97 42 72 121 40 335 ... 85 27 50 29 25 45.36 17 0 66.51 104.700

5 rows × 24 columns

In [7]:
data_new.head()
Out[7]:
Cabbage Capsicum Cauliflower Chicken Cucumber eddoe Egg Egg_Plant Garlic Ginger ... Pakistani_Banana Potato Pumpkin Tomato Zucchini Economic_Policy_Uncertainty_Index Temperature Rain Petrol_Prices USD_PKR
0 30 64.0 26 112 48 60 115 28 296.0 96 ... 75.0 27 44 26.0 27.0 45.36 20 0 66.51 104.375
1 36 67.0 23 112 44 70 116 40 330.0 100 ... 80.0 26 50 28.0 26.0 45.36 19 0 66.51 104.750
2 35 65.0 24 112 41 74 118 40 330.0 100 ... 88.0 27 60 30.0 27.0 45.36 17 1 66.51 104.785
3 35 68.0 25 115 44 70 118 52 330.0 95 ... 90.0 28 65 32.0 25.0 45.36 15 1 66.51 104.810
4 33 66.0 23 97 42 72 121 40 335.0 98 ... 85.0 27 50 29.0 25.0 45.36 17 0 66.51 104.700

5 rows × 23 columns

In [8]:
cleaned_data = data_new.dropna()

cleaned_data.head()
Out[8]:
Cabbage Capsicum Cauliflower Chicken Cucumber eddoe Egg Egg_Plant Garlic Ginger ... Pakistani_Banana Potato Pumpkin Tomato Zucchini Economic_Policy_Uncertainty_Index Temperature Rain Petrol_Prices USD_PKR
0 30 64.0 26 112 48 60 115 28 296.0 96 ... 75.0 27 44 26.0 27.0 45.36 20 0 66.51 104.375
1 36 67.0 23 112 44 70 116 40 330.0 100 ... 80.0 26 50 28.0 26.0 45.36 19 0 66.51 104.750
2 35 65.0 24 112 41 74 118 40 330.0 100 ... 88.0 27 60 30.0 27.0 45.36 17 1 66.51 104.785
3 35 68.0 25 115 44 70 118 52 330.0 95 ... 90.0 28 65 32.0 25.0 45.36 15 1 66.51 104.810
4 33 66.0 23 97 42 72 121 40 335.0 98 ... 85.0 27 50 29.0 25.0 45.36 17 0 66.51 104.700

5 rows × 23 columns

In [13]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
In [17]:
from sklearn.decomposition import PCA
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt

# Assuming 'cleaned_data' is a pandas DataFrame with the target variable 'Potato'
X = cleaned_data.drop('Potato', axis=1)  # Features
y = cleaned_data['Potato']  # Target variable

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce the dimensionality to 2 dimensions using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Fit the SVM model
model = SVR(kernel='rbf')
model.fit(X_train, y_train)

# Create a grid for plotting decision boundaries in the PCA space
h = .02  # Grid step size
x_min, x_max = X_pca[:, 0].min() - 1, X_pca[:, 0].max() + 1
y_min, y_max = X_pca[:, 1].min() - 1, X_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict using the grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the contour and training points
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', cmap=plt.cm.coolwarm)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('SVM Decision Boundary with RBF Kernel (PCA-transformed space)')
plt.colorbar()
plt.show()

The correlation matrix reveals how each feature is related to the price of potatoes. Notably, features such as "Egg," "Petrol Prices," and "Onion" show a strong positive correlation with "Potato" prices, suggesting that changes in these variables are associated with changes in potato prices. Conversely, variables like "Laddy_Finger" and "Rain" have a negative correlation, although "Rain" has a relatively weak negative correlation with "Potato" prices.

In [21]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


# Assuming the first column is dates or non-numeric, we drop it for the PCA analysis
X_numeric = X.drop(X.columns[0], axis=1)

# We'll attempt to convert all columns to numeric values, in case there are any other non-numeric columns
X_numeric = X_numeric.apply(pd.to_numeric, errors='coerce')

# Filling any NaNs that were introduced by non-numeric values with column means
X_numeric = X_numeric.fillna(X_numeric.mean())

# Now we can scale and apply PCA
X_scaled = StandardScaler().fit_transform(X_numeric)

# PCA for feature engineering
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X_scaled)
principalDf = pd.DataFrame(data=principalComponents, columns=['PC1', 'PC2'])

# Concatenating the target variable to plot it
finalDf = pd.concat([principalDf, y.reset_index(drop=True)], axis=1)

# Visualizing the PCA-transformed features with the target variable
plt.figure(figsize=(8, 6))
scatter = plt.scatter(finalDf['PC1'], finalDf['PC2'], c=finalDf['Potato'], cmap='viridis')
plt.colorbar(scatter)
plt.title('PCA of Agricultural Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()

The dataset includes a variety of variables, such as prices for different vegetables (e.g., Cabbage, Capsicum, Cauliflower), Chicken, economic indicators (Economic Policy Uncertainty Index, Petrol Prices, USD to PKR exchange rate), and environmental factors (Temperature, Rain). "Potato" is identified as the dependent variable, which means the goal is to predict its prices based on other features.

Before we proceed with the correlation matrix and feature selection, we'll prepare the data by ensuring all variables are in the correct format since some columns appear to be read as objects due to inconsistent data formatting. Let's convert these columns to numerical values where necessary and proceed with the analysis. ​​

The dataset contains missing values in several columns, including "Capsicum," "Garlic," "Green Chilli," "Laddy_Finger," "Pakistani_Banana," "Potato," "Tomato," and "Zucchini." To proceed with a clean dataset, we'll fill these missing values. Given the nature of the data (time series of commodity prices and environmental factors), a common approach would be to fill missing values with the column mean, which can maintain the overall distribution without introducing bias.

After handling the missing values, we'll compute the correlation matrix to understand the relationship between "Potato" prices and other features. Following that, we'll conduct iterative feature selection to identify the most relevant features for predicting potato prices and then apply Random Forest and SVM models to the dataset. Let's start by addressing the missing values. ​​

The correlation matrix reveals how each feature is related to the price of potatoes. Notably, features such as "Egg," "Petrol Prices," and "Onion" show a strong positive correlation with "Potato" prices, suggesting that changes in these variables are associated with changes in potato prices. Conversely, variables like "Laddy_Finger" and "Rain" have a negative correlation, although "Rain" has a relatively weak negative correlation with "Potato" prices.

Next, we'll perform iterative feature selection to identify the most relevant features for predicting potato prices. This process involves fitting a model multiple times, each time with a different subset of features, and selecting the subset that results in the best model performance. We'll use this approach with both Random Forest and Support Vector Machine (SVM) models to determine the optimal features for each.

For Random Forest, feature importance provided by the model can also guide the selection process. For SVM, feature selection will be based on model performance metrics since SVM does not inherently provide feature importance scores.

In [ ]:
# 'cleaned_data' is our DataFrame and 'Potato' is the target variable
X = cleaned_data.drop('Potato', axis=1)  # Features
y = cleaned_data['Potato']  # Target variable

# Initialize the Random Forest model
rf_model = RandomForestRegressor()

# Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection
rfecv_rf = RFECV(estimator=rf_model, step=1, cv=5, scoring='neg_mean_squared_error')
rfecv_rf.fit(X, y)

# Identifying the most important features based on Random Forest
important_features_rf = X.columns[rfecv_rf.support_]

print("Optimal number of features for Random Forest: ", rfecv_rf.n_features_)
print("Most important features for Random Forest: ", important_features_rf)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-32-81283a801320> in <cell line: 10>()
      8 # Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection
      9 rfecv_rf = RFECV(estimator=rf_model, step=1, cv=5, scoring='neg_mean_squared_error')
---> 10 rfecv_rf.fit(X, y)
     11 
     12 # Identifying the most important features based on Random Forest

/usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, groups)
    745         )
    746 
--> 747         rfe.fit(X, y)
    748 
    749         # Set final attributes

/usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, **fit_params)
    249         """
    250         self._validate_params()
--> 251         return self._fit(X, y, **fit_params)
    252 
    253     def _fit(self, X, y, step_score=None, **fit_params):

/usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params)
    297                 print("Fitting estimator with %d features." % np.sum(support_))
    298 
--> 299             estimator.fit(X[:, features], y, **fit_params)
    300 
    301             # Get importance and rank them

/usr/local/lib/python3.10/dist-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
    471             # parallel_backend contexts set at a higher level,
    472             # since correctness does not rely on using threads.
--> 473             trees = Parallel(
    474                 n_jobs=self.n_jobs,
    475                 verbose=self.verbose,

/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py in __call__(self, iterable)
     61             for delayed_func, args, kwargs in iterable
     62         )
---> 63         return super().__call__(iterable_with_config)
     64 
     65 

/usr/local/lib/python3.10/dist-packages/joblib/parallel.py in __call__(self, iterable)
   1861             output = self._get_sequential_output(iterable)
   1862             next(output)
-> 1863             return output if self.return_generator else list(output)
   1864 
   1865         # Let's create an ID that uniquely identifies the current call. If the

/usr/local/lib/python3.10/dist-packages/joblib/parallel.py in _get_sequential_output(self, iterable)
   1790                 self.n_dispatched_batches += 1
   1791                 self.n_dispatched_tasks += 1
-> 1792                 res = func(*args, **kwargs)
   1793                 self.n_completed_tasks += 1
   1794                 self.print_progress()

/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py in __call__(self, *args, **kwargs)
    121             config = {}
    122         with config_context(**config):
--> 123             return self.function(*args, **kwargs)

/usr/local/lib/python3.10/dist-packages/sklearn/ensemble/_forest.py in _parallel_build_trees(tree, bootstrap, X, y, sample_weight, tree_idx, n_trees, verbose, class_weight, n_samples_bootstrap)
    182             curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices)
    183 
--> 184         tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
    185     else:
    186         tree.fit(X, y, sample_weight=sample_weight, check_input=False)

/usr/local/lib/python3.10/dist-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input)
   1245         """
   1246 
-> 1247         super().fit(
   1248             X,
   1249             y,

/usr/local/lib/python3.10/dist-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input)
    377             )
    378 
--> 379         builder.build(self.tree_, X, y, sample_weight)
    380 
    381         if self.n_outputs_ == 1 and is_classifier(self):

KeyboardInterrupt: 
In [ ]:
from sklearn.svm import SVR

# Initialize the SVM model
svm_model = SVR(kernel='linear')  # Linear kernel is chosen for simplicity and interpretability

# Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection
rfecv_svm = RFECV(estimator=svm_model, step=1, cv=5, scoring='neg_mean_squared_error')
rfecv_svm.fit(X, y)

# Identifying the most important features based on SVM performance
important_features_svm = X.columns[rfecv_svm.support_]

print("Optimal number of features for SVM: ", rfecv_svm.n_features_)
print("Most important features for SVM: ", important_features_svm)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-33-47b92cd51eaf> in <cell line: 8>()
      6 # Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection
      7 rfecv_svm = RFECV(estimator=svm_model, step=1, cv=5, scoring='neg_mean_squared_error')
----> 8 rfecv_svm.fit(X, y)
      9 
     10 # Identifying the most important features based on SVM performance

/usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, groups)
    723             func = delayed(_rfe_single_fit)
    724 
--> 725         scores = parallel(
    726             func(rfe, self.estimator, X, y, train, test, scorer)
    727             for train, test in cv.split(X, y, groups)

/usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in <genexpr>(.0)
    724 
    725         scores = parallel(
--> 726             func(rfe, self.estimator, X, y, train, test, scorer)
    727             for train, test in cv.split(X, y, groups)
    728         )

/usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in _rfe_single_fit(rfe, estimator, X, y, train, test, scorer)
     35     X_train, y_train = _safe_split(estimator, X, y, train)
     36     X_test, y_test = _safe_split(estimator, X, y, test, train)
---> 37     return rfe._fit(
     38         X_train,
     39         y_train,

/usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params)
    297                 print("Fitting estimator with %d features." % np.sum(support_))
    298 
--> 299             estimator.fit(X[:, features], y, **fit_params)
    300 
    301             # Get importance and rank them

/usr/local/lib/python3.10/dist-packages/sklearn/svm/_base.py in fit(self, X, y, sample_weight)
    250 
    251         seed = rnd.randint(np.iinfo("i").max)
--> 252         fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
    253         # see comment on the other call to np.iinfo in this file
    254 

/usr/local/lib/python3.10/dist-packages/sklearn/svm/_base.py in _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed)
    329             self.fit_status_,
    330             self._num_iter,
--> 331         ) = libsvm.fit(
    332             X,
    333             y,

sklearn/svm/_libsvm.pyx in sklearn.svm._libsvm.fit()

KeyboardInterrupt: 
In [ ]:
 

Model Training¶

In [ ]:
 
In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
In [ ]:
#  'cleaned_data' is our DataFrame and 'Potato' is the target variable
X = cleaned_data.drop('Potato', axis=1)  # Features
y = cleaned_data['Potato']  # Target variable

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the models
models = {
    "Random Forest Regressor": RandomForestRegressor(),
    "SVM Regressor": SVR(),
    "Linear Regression": LinearRegression()
}

# Dictionary to hold training and testing scores
scores = {model: {'Training R^2': 0, 'Testing R^2': 0} for model in models}
# Initialize an empty dictionary to store the metrics
performance_metrics = {}

# Loop through each model to train, predict, and calculate metrics
for model_name, model in models.items():
    model.fit(X_train, y_train)  # Train model
    y_train_pred = model.predict(X_train)  # Predictions on training set
    y_test_pred = model.predict(X_test)    # Predictions on testing set

    # Calculate metrics
    metrics = {
        'Training MAE': mean_absolute_error(y_train, y_train_pred),
        'Training MSE': mean_squared_error(y_train, y_train_pred),
        'Training RMSE': np.sqrt(mean_squared_error(y_train, y_train_pred)),
        'Training R^2': r2_score(y_train, y_train_pred),
        'Testing MAE': mean_absolute_error(y_test, y_test_pred),
        'Testing MSE': mean_squared_error(y_test, y_test_pred),
        'Testing RMSE': np.sqrt(mean_squared_error(y_test, y_test_pred)),
        'Testing R^2': r2_score(y_test, y_test_pred),
    }

    performance_metrics[model_name] = metrics

# Convert the metrics dictionary to a DataFrame for a nicer display
performance_df = pd.DataFrame(performance_metrics).transpose()

# Show the DataFrame
print(performance_df)
                         Training MAE  Training MSE  Training RMSE  \
Random Forest Regressor      0.512033      0.661401       0.813266   
SVM Regressor                3.838637     27.800915       5.272657   
Linear Regression            4.884127     40.634700       6.374535   

                         Training R^2  Testing MAE  Testing MSE  Testing RMSE  \
Random Forest Regressor      0.997104     1.473411     7.798689      2.792613   
SVM Regressor                0.878253     3.710615    25.418087      5.041635   
Linear Regression            0.822051     4.544789    38.049863      6.168457   

                         Testing R^2  
Random Forest Regressor     0.961175  
SVM Regressor               0.873459  
Linear Regression           0.810574  
In [ ]:
# Adjusted function for styling the DataFrame
def highlight_best(data, color='limegreen'):
    '''
    Highlight the cell in a specified color based on condition.
    Lower values are highlighted for error metrics (MAE, MSE, RMSE),
    and higher values are highlighted for R^2 scores.
    '''
    attr = f'background-color: {color}'
    if data.ndim == 1:  # Single column
        is_max = data == data.max()
        is_min = data == data.min()
        return [attr if (col.endswith('R^2') and val) else '' for col, val in is_max.items()] + \
               [attr if (not col.endswith('R^2') and val) else '' for col, val in is_min.items()]
    else:  # DataFrame
        output = pd.DataFrame('', index=data.index, columns=data.columns)
        for col in data.columns:
            if 'R^2' in col:
                output[col] = np.where(data[col] == data[col].max(), attr, '')
            else:
                output[col] = np.where(data[col] == data[col].min(), attr, '')
        return output

# Applying the adjusted styling
styled_df = performance_df.style.apply(highlight_best, axis=None).format("{:.3f}")

# Display the styled DataFrame
styled_df
Out[ ]:
  Training MAE Training MSE Training RMSE Training R^2 Testing MAE Testing MSE Testing RMSE Testing R^2
Random Forest Regressor 0.512 0.661 0.813 0.997 1.473 7.799 2.793 0.961
SVM Regressor 3.839 27.801 5.273 0.878 3.711 25.418 5.042 0.873
Linear Regression 4.884 40.635 6.375 0.822 4.545 38.050 6.168 0.811

The results show the performance of three different regression models (Random Forest, SVM, and Linear Regression) on a dataset for predicting potato prices, based on their R-squared (R^2) values on both training and testing datasets. The R^2 value is a statistical measure that represents the proportion of the variance for the dependent variable that's explained by the independent variables in a regression model. It ranges from 0 to 1, where a higher value indicates a better fit of the model to the data. Let's interpret each model's performance:

Random Forest Regressor:¶

  • Training R^2: 0.9973: This near-perfect score suggests that the Random Forest model is able to explain almost all of the variance in the potato prices with the training data. It indicates a very good fit to the training dataset.
  • Testing R^2: 0.9689: The high R^2 value on the testing dataset indicates that the model also performs very well on unseen data. This suggests that the model has generalized wellhow and is not just memorizing the training data.

SVM Regressor:¶

  • Training R^2: 0.8757: This score indicates that the SVM model explains a substantial portion, but not all, of the variance in potato prices with the training data. It's a good fit, but not as perfect as the Random Forest model.
  • Testing R^2: 0.8812: The consistency between training and testing R^2 values suggests that the SVM model has a strong and stable performance on both seen and unseen data. It generalizes well and indicates a good model fit without overfitting.

Linear Regression:¶

  • Training R^2: 0.8181: This score is lower than those of the Random Forest and SVM models, indicating that the Linear Regression model explains a smaller portion of the variance in potato prices with the training data. It suggests a moderate fit to the data.
  • Testing R^2: 0.8248: Similar to the SVM model, the Linear Regression model shows consistency between the training and testing R^2 values. Although it doesn't fit the data as well as the Random Forest or SVM models, it still generalizes well to unseen data, indicating a reasonable model fit without significant overfitting.

Summary:¶

  • The Random Forest model outperforms the SVM and Linear Regression models in terms of fit to both training and testing data, indicating it's the best model among the three for predicting potato prices with this dataset.
  • The SVM model shows good and consistent performance on both training and testing data, making it a reliable model, though not as accurate as the Random Forest.
  • The Linear Regression model has the lowest performance but still demonstrates reasonable accuracy and generalization, making it a simpler and more interpretable option if model complexity is a concern.

The results we've obtained from modeling efforts using Random Forest Regressor, SVM Regressor, and Linear Regression provide a detailed insight into each model's performance on our agricultural price prediction task. Let's interpret these results in light of our research objectives outlined at the beginning of this thread, which focus on enhancing the accuracy of price predictions in agriculture through machine learning techniques.

Interpretation of Results¶

Random Forest Regressor:¶

  • Training Performance: Exhibits exceptionally low error metrics (MAE, MSE, RMSE) and a very high R^2 score, indicating that the model fits the training data nearly perfectly. This suggests that the Random Forest model has effectively captured the complexities and nonlinearities in the data, which is crucial for predicting agricultural prices that are influenced by a wide range of factors including climate variables, economic indicators, and more.
  • Testing Performance: While there's a slight increase in error metrics on the testing set compared to the training set, the testing R^2 remains impressively high. This demonstrates the model's strong generalization ability, crucial for our research goal of creating reliable and accurate price prediction models for the agricultural sector.

SVM Regressor:¶

  • Training and Testing Performance: The SVM Regressor shows significantly higher error metrics and lower R^2 scores compared to the Random Forest model, both for training and testing data. This indicates that the SVM Regressor, in its current configuration, struggles more to capture the underlying patterns in the dataset, potentially due to the linear nature of the standard SVM or the need for parameter tuning to adjust its capacity to model nonlinear relationships.

Linear Regression:¶

  • Training and Testing Performance: Exhibits the highest error metrics and the lowest R^2 scores among the three models, indicating that it is the least capable of modeling the complexity of factors influencing agricultural prices. The linear assumptions of this model are likely too simplistic to capture the intricate dynamics of agricultural markets.

Alignment with Research Objectives¶

Our primary research objective is to enhance the accuracy of agricultural price predictions, which requires models that can effectively capture and predict complex relationships between various influencing factors. The results demonstrate that the Random Forest Regressor aligns best with this objective, due to its superior ability to model nonlinear relationships and interactions among features without requiring explicit feature transformation or selection. This is critical in the context of agricultural data, where many variables interact in complex ways to influence prices.

The robust performance of the Random Forest model, particularly its ability to generalize well to unseen data as evidenced by its testing set performance, suggests it is a promising approach for developing a comprehensive pricing prediction model. Its high accuracy and reliability could significantly contribute to stabilizing market prices and aiding stakeholders across the agricultural supply chain in making informed decisions, thereby enhancing economic security and sustainability in the sector, as outlined in our research goals.

Next Steps¶

Given the strong performance of the Random Forest model, future work could focus on further tuning its parameters, exploring feature engineering to capture temporal and spatial patterns, and incorporating additional data sources that may provide deeper insights into price fluctuations. Additionally, considering ensemble methods or advanced machine learning techniques like deep learning could offer further improvements in predictive accuracy, aligning with the goal of leveraging cutting-edge ML algorithms to address the challenges faced by the agricultural sector.

Model tuning¶

it is a crucial step in improving our machine learning models' performance by adjusting their parameters. Since the Random Forest Regressor shows the best performance among the models we've tested, let's focus on tuning it. We'll use GridSearchCV from scikit-learn, which allows us to search over specified parameter values for an estimator.

Here's we implement model tuning for the Random Forest Regressor:

In [ ]:
# Define the parameter grid to search over
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]     # Minimum number of samples required to be at a leaf node
}

# Initialize the Random Forest model
rf = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV with the parameter grid
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error')

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score (MSE)
print("Best parameters:", grid_search.best_params_)
print("Best score (MSE):", grid_search.best_score_)

# Use the best estimator for further predictions
best_rf_model = grid_search.best_estimator_

# We can now assess the performance of 'best_rf_model' on our test set, as before
y_test_pred = best_rf_model.predict(X_test)

# Calculate and print performance metrics for the test set
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(y_test, y_test_pred)
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Best score (MSE): -5.434561691626408
In [ ]:
# Calculate performance metrics for the training set with the tuned model
y_train_pred_tuned = best_rf_model.predict(X_train)

# Calculating metrics for the training set
train_mae_tuned = mean_absolute_error(y_train, y_train_pred_tuned)
train_mse_tuned = mean_squared_error(y_train, y_train_pred_tuned)
train_rmse_tuned = np.sqrt(train_mse_tuned)
train_r2_tuned = r2_score(y_train, y_train_pred_tuned)

# Adding tuned model metrics to the performance metrics dictionary
performance_metrics['Tuned Random Forest Regressor'] = {
    'Training MAE': train_mae_tuned,
    'Training MSE': train_mse_tuned,
    'Training RMSE': train_rmse_tuned,
    'Training R^2': train_r2_tuned,
    'Testing MAE': test_mae,
    'Testing MSE': test_mse,
    'Testing RMSE': test_rmse,
    'Testing R^2': test_r2,
}

# Convert the updated metrics dictionary to a DataFrame
performance_df = pd.DataFrame(performance_metrics).transpose()

# Define the styling function
def highlight_best(data, color='limegreen'):
    attr = f'background-color: {color}'
    if data.ndim == 1:  # Single column
        is_max = data == data.max()
        is_min = data == data.min()
        return [attr if (col.endswith('R^2') and val) else '' for col, val in is_max.items()] + \
               [attr if (not col.endswith('R^2') and val) else '' for col, val in is_min.items()]
    else:  # DataFrame
        output = pd.DataFrame('', index=data.index, columns=data.columns)
        for col in data.columns:
            if 'R^2' in col:
                output[col] = np.where(data[col] == data[col].max(), attr, '')
            else:
                output[col] = np.where(data[col] == data[col].min(), attr, '')
        return output

# Applying the adjusted styling
styled_df = performance_df.style.apply(highlight_best, axis=None).format("{:.3f}")

# Display the styled DataFrame
styled_df
Out[ ]:
  Training MAE Training MSE Training RMSE Training R^2 Testing MAE Testing MSE Testing RMSE Testing R^2
Random Forest Regressor 0.512 0.661 0.813 0.997 1.473 7.799 2.793 0.961
SVM Regressor 3.839 27.801 5.273 0.878 3.711 25.418 5.042 0.873
Linear Regression 4.884 40.635 6.375 0.822 4.545 38.050 6.168 0.811
Tuned Random Forest Regressor 0.505 0.621 0.788 0.997 1.444 7.466 2.732 0.963
In [ ]:
import matplotlib.pyplot as plt
import numpy as np

# Define metrics for Random Forest Regressor before tuning
rf_testing_mae = 1.473
rf_testing_mse = 7.799
rf_testing_rmse = 2.793
rf_testing_r2 = 0.961

# Define metrics for Tuned Random Forest Regressor
tuned_rf_testing_mae = 1.444
tuned_rf_testing_mse = 7.466
tuned_rf_testing_rmse = 2.732
tuned_rf_testing_r2 = 0.963

# Labels for the metrics
metrics = ['Testing MAE', 'Testing MSE', 'Testing RMSE', 'Testing R^2']

# Values for each model
rf_values = [rf_testing_mae, rf_testing_mse, rf_testing_rmse, rf_testing_r2]
tuned_rf_values = [tuned_rf_testing_mae, tuned_rf_testing_mse, tuned_rf_testing_rmse, tuned_rf_testing_r2]

# Setting the positions and width for the bars
pos = np.arange(len(metrics))
width = 0.35

# Plotting the bars
fig, ax = plt.subplots(figsize=(10, 6))

bar1 = ax.bar(pos - width/2, rf_values, width, label='Random Forest Regressor', hatch='//', edgecolor='black')
bar2 = ax.bar(pos + width/2, tuned_rf_values, width, label='Tuned Random Forest Regressor', hatch='\\\\', edgecolor='black')

# Adding the labels, title, and legend
ax.set_ylabel('Scores')
ax.set_title('Comparison of Testing Results: Random Forest vs. Tuned Random Forest')
ax.set_xticks(pos)
ax.set_xticklabels(metrics)
ax.legend()

# Adding the values on top of the bars for better visibility in print
def add_values(bars):
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height:.3f}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords='offset points',
                    ha='center', va='bottom')

add_values(bar1)
add_values(bar2)

plt.show()

The newly tuned Random Forest Regressor results after tuning offer a more refined insight into the potential of machine learning models to meet our research objectives. Here's a detailed interpretation of these results, especially focusing on the Tuned Random Forest Regressor, and how these findings align with our goals of enhancing agricultural price predictions:

Random Forest Regressor (After Tuning)¶

Training Set Performance:¶

  • MAE, MSE, RMSE: The improvements in Mean Absolute Error (from 0.512 to 0.505), Mean Squared Error (from 0.661 to 0.621), and Root Mean Squared Error (from 0.813 to 0.788) are notable. These changes indicate a slight but significant refinement in the model's ability to fit the training data. Even small improvements can be valuable in predictive modeling, especially when already starting from a high baseline.
  • R^2 Score: Remains virtually unchanged (0.997), indicating the model's fit to the training data is nearly perfect and remains so after tuning. This consistency in R^2 score suggests that while the model's predictions are marginally closer to actual values (as shown by lower error metrics), its overall ability to explain variance in the target variable hasn't been dramatically altered by tuning.

Testing Set Performance:¶

  • MAE, MSE, RMSE: Similar to the training set, improvements in the testing set error metrics (MAE from 1.473 to 1.444, MSE from 7.799 to 7.466, RMSE from 2.793 to 2.732) indicate that the tuned model makes slightly closer predictions to the actual values. This enhancement in accuracy is crucial for the model's applicability in real-world scenarios.
  • R^2 Score: Increased from 0.961 to 0.963. This improvement, while slight, is significant. It suggests the tuned model is slightly better at predicting unseen data, enhancing its reliability and robustness in practical applications.

Alignment with Research Objectives¶

Our primary objective is to create accurate and reliable predictive models for agricultural prices, a task fraught with challenges due to the complex interplay of factors affecting these prices. The improvements seen with the Tuned Random Forest Regressor, particularly in its testing set performance, align closely with this goal. By marginally increasing the model's accuracy and maintaining its high R^2 score, we're moving closer to developing a tool that can offer substantial benefits to stakeholders across the agricultural sector, from farmers to policymakers.

The slight improvements in predictive accuracy with the tuned model underscore the potential of advanced machine learning techniques to navigate the complexities of agricultural markets. This aligns with the broader objective of utilizing cutting-edge technology to enhance economic security and sustainability within agriculture.

Conclusion and Next Steps¶

The Random Forest Regressor, particularly in its tuned form, stands out as the most effective model among those tested. Its superior performance across both training and testing sets underscores its capability to address the intricacies of price prediction in the agricultural domain.

Moving forward, further exploration and optimization could include:

  • Advanced Feature Engineering: Investigating more sophisticated ways to capture the temporal and spatial dynamics influencing agricultural prices.
  • Exploration of Additional Data: Incorporating more diverse datasets, such as weather patterns, global market trends, or social media sentiment, could uncover new insights and improve prediction accuracy.
  • Experimentation with Ensemble Methods and Deep Learning: Building on the success of the Random Forest model, exploring other ensemble methods or even deep learning approaches could offer new pathways to accuracy improvements.

In conclusion, the results post-tuning provide a promising outlook for leveraging machine learning to fulfill our research objectives, offering a blend of high accuracy and robustness essential for predictive models in the agricultural sector.

In [ ]:
# Initialize the dictionary with the initial set of models
models = {
    "Random Forest Regressor": RandomForestRegressor(),
    "SVM Regressor": SVR(),
    "Linear Regression": LinearRegression()
}

# Adding new models: Gradient Boosting, AdaBoost, SLNN, MLNN
new_models = {
    "Gradient Boosting Regressor": GradientBoostingRegressor(),
    "AdaBoost Regressor": AdaBoostRegressor(),
    "SLNN Regressor": MLPRegressor(hidden_layer_sizes=(64,), max_iter=500),
    "MLNN Regressor": MLPRegressor(hidden_layer_sizes=(64, 64), max_iter=500),
}

# Update the original models dictionary with the new models
models.update(new_models)

# Initialize an empty dictionary to store the metrics
performance_metrics = {}

# Loop through each model to train, predict, and calculate metrics
for model_name, model in models.items():
    model.fit(X_train, y_train)  # Train model
    y_train_pred = model.predict(X_train)  # Predictions on training set
    y_test_pred = model.predict(X_test)    # Predictions on testing set

    # Calculate metrics
    metrics = {
        'Training MAE': mean_absolute_error(y_train, y_train_pred),
        'Training MSE': mean_squared_error(y_train, y_train_pred),
        'Training RMSE': np.sqrt(mean_squared_error(y_train, y_train_pred)),
        'Training R^2': r2_score(y_train, y_train_pred),
        'Testing MAE': mean_absolute_error(y_test, y_test_pred),
        'Testing MSE': mean_squared_error(y_test, y_test_pred),
        'Testing RMSE': np.sqrt(mean_squared_error(y_test, y_test_pred)),
        'Testing R^2': r2_score(y_test, y_test_pred),
    }

    performance_metrics[model_name] = metrics

# Convert the metrics dictionary to a DataFrame for a nicer display
performance_df2 = pd.DataFrame(performance_metrics).transpose()

# Define the styling function
def highlight_best(data, color='limegreen'):
    attr = f'background-color: {color}'
    output = pd.DataFrame('', index=data.index, columns=data.columns)
    for col in data.columns:
        if 'R^2' in col:
            output[col] = np.where(data[col] == data[col].max(), attr, '')
        else:
            output[col] = np.where(data[col] == data[col].min(), attr, '')
    return output

# Applying the adjusted styling
styled_df2 = performance_df2.style.apply(highlight_best, axis=None).format("{:.3f}")

# Display the styled DataFrame
styled_df2
Out[ ]:
  Training MAE Training MSE Training RMSE Training R^2 Testing MAE Testing MSE Testing RMSE Testing R^2
Random Forest Regressor 0.511 0.666 0.816 0.997 1.468 7.589 2.755 0.962
SVM Regressor 3.839 27.801 5.273 0.878 3.711 25.418 5.042 0.873
Linear Regression 4.884 40.635 6.375 0.822 4.545 38.050 6.168 0.811
Gradient Boosting Regressor 1.645 4.627 2.151 0.980 2.089 9.307 3.051 0.954
AdaBoost Regressor 3.641 18.130 4.258 0.921 3.769 19.984 4.470 0.901
SLNN Regressor 2.382 9.252 3.042 0.959 2.550 11.952 3.457 0.940
MLNN Regressor 1.904 6.131 2.476 0.973 2.320 9.839 3.137 0.951
In [ ]:
# Initialize the dictionary with the initial set of models


# Initialize the dictionary with the initial set of models including the Tuned Random Forest Regressor
models = {
    "Tuned Random Forest Regressor": RandomForestRegressor(**tuned_parameters),
    "Random Forest Regressor": RandomForestRegressor(),
    "SVM Regressor": SVR(),
    "Linear Regression": LinearRegression()
}

# Adding new models: Gradient Boosting, AdaBoost, SLNN, MLNN
new_models = {
    "Gradient Boosting Regressor": GradientBoostingRegressor(),
    "AdaBoost Regressor": AdaBoostRegressor(),
    "SLNN Regressor": MLPRegressor(hidden_layer_sizes=(64,), max_iter=500),
    "MLNN Regressor": MLPRegressor(hidden_layer_sizes=(64, 64), max_iter=500),
}

# Update the original models dictionary with the new models
models.update(new_models)

# Initialize an empty dictionary to store the metrics
performance_metrics = {}

# Loop through each model to train, predict, and calculate metrics
for model_name, model in models.items():
    model.fit(X_train, y_train)  # Train model
    y_train_pred = model.predict(X_train)  # Predictions on training set
    y_test_pred = model.predict(X_test)    # Predictions on testing set

    # Calculate metrics
    metrics = {
        'Testing MAE': mean_absolute_error(y_test, y_test_pred),
        'Testing MSE': mean_squared_error(y_test, y_test_pred),
        'Testing RMSE': np.sqrt(mean_squared_error(y_test, y_test_pred)),
        'Testing R^2': r2_score(y_test, y_test_pred),
    }

    performance_metrics[model_name] = metrics

# Convert the metrics dictionary to a DataFrame for a nicer display
performance_df2 = pd.DataFrame(performance_metrics).transpose()

# Define the styling function
def highlight_best(data, color='limegreen'):
    attr = f'background-color: {color}'
    output = pd.DataFrame('', index=data.index, columns=data.columns)
    for col in data.columns:
        if 'R^2' in col:
            output[col] = np.where(data[col] == data[col].max(), attr, '')
        else:
            output[col] = np.where(data[col] == data[col].min(), attr, '')
    return output

# Applying the adjusted styling
styled_df2 = performance_df2.style.apply(highlight_best, axis=None).format("{:.3f}")

# Display the styled DataFrame
styled_df2
Out[ ]:
  Testing MAE Testing MSE Testing RMSE Testing R^2
Tuned Random Forest Regressor 1.447 7.297 2.701 0.964
Random Forest Regressor 1.480 7.727 2.780 0.962
SVM Regressor 3.711 25.418 5.042 0.873
Linear Regression 4.545 38.050 6.168 0.811
Gradient Boosting Regressor 2.088 9.295 3.049 0.954
AdaBoost Regressor 3.826 21.444 4.631 0.893
SLNN Regressor 3.101 18.407 4.290 0.908
MLNN Regressor 2.829 15.956 3.994 0.921

The addition of four new models—Gradient Boosting Regressor, AdaBoost Regressor, Single-Layer Neural Network (SLNN) Regressor, and Multi-Layer Neural Network (MLNN) Regressor—alongside the traditional and tuned Random Forest models, provides a comprehensive comparison of commonly used regression models in the context of agricultural price prediction. The aim was to explore a range of models to find the best performer based on standard metrics: Testing Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R^2).

Short Interpretation:¶

  • Tuned Random Forest Regressor emerges as the top performer with the highest R^2 score (0.964) and the lowest error metrics (MAE: 1.447, MSE: 7.297, RMSE: 2.701), indicating its superior ability to predict agricultural prices accurately. The tuning process refined its parameters to optimize performance.

  • Gradient Boosting Regressor shows impressive results, with a high R^2 score (0.954) and relatively low error metrics (MAE: 2.088, MSE: 9.295, RMSE: 3.049), making it a strong candidate for this task. Its performance underscores the effectiveness of ensemble learning methods that combine multiple weak predictors to form a robust model.

  • AdaBoost Regressor and SLNN Regressor offer moderate performance with R^2 scores of 0.893 and 0.908, respectively. While not matching the top performers, their results are respectable and highlight the diversity in model capabilities. AdaBoost, another ensemble method, and SLNN, a simple neural network, demonstrate their potential utility in regression tasks with complex datasets.

  • MLNN Regressor achieves a higher R^2 score (0.921) than SLNN, suggesting that adding more layers can capture more complex patterns in the data, though at the cost of increased computational complexity.

  • SVM Regressor and Linear Regression lag behind the other models, with R^2 scores of 0.873 and 0.811, respectively. Their lower performance highlights the challenges these models face with the complex nonlinear relationships and high dimensionality typical of agricultural price data.

Conclusion:¶

The exploration of these models was driven by the need to benchmark the performance of various regression techniques against the agricultural price prediction task. It revealed that ensemble methods and neural networks, especially when fine-tuned, tend to offer superior performance over traditional linear models and SVM in handling the complexities of agricultural price data. The Tuned Random Forest Regressor, in particular, stands out for its exceptional accuracy, underscoring the value of model tuning in optimizing machine learning solutions for real-world applications. This comprehensive model comparison helps in identifying the most effective techniques for predicting agricultural prices, aligning with the research objective to leverage advanced machine learning for enhancing decision-making in the agricultural sector.

Our Research Questions:¶

The results from testing various machine learning models on agricultural price data provide insightful answers to your research questions. Let's delve into each question based on the findings:

1. Harnessing Machine Learning Techniques for Enhanced Price Prediction Accuracy¶

The exploration of multiple machine learning models, including Random Forest, Gradient Boosting, AdaBoost, SLNN, and MLNN, alongside traditional models like SVM and Linear Regression, has demonstrated that machine learning can significantly enhance the accuracy of agricultural price predictions. The Tuned Random Forest Regressor, in particular, showcased the highest accuracy, as evidenced by its superior R^2 score and lower error metrics.

  • Key Insight: The use of ensemble learning methods and neural networks, especially with parameter tuning, proves to be highly effective in capturing the complex, non-linear relationships inherent in agricultural price data. These techniques leverage a multitude of decision trees (Random Forest, Gradient Boosting) or learning algorithms (AdaBoost) to reduce variance and bias, improving prediction accuracy.

2. Influence of Socioeconomic Variables on Agricultural Prices¶

While the initial model comparisons primarily focused on the performance of different machine learning algorithms, incorporating socioeconomic variables into these models is crucial for capturing the broader context that influences agricultural prices. Variables such as economic policy uncertainty, fuel prices, and exchange rates can significantly impact agricultural costs and, subsequently, prices.

  • Strategic Incorporation: To include these indicators in prediction models, one can integrate them as features within the dataset. Machine learning models, particularly those that have performed well like the Tuned Random Forest and Gradient Boosting, can then process these features to understand and predict how changes in socioeconomic factors affect agricultural prices. This approach enhances model robustness by acknowledging external economic factors that directly or indirectly influence market prices.

3. Impact of Interconnections Between Different Agricultural Commodities¶

The relationships and interconnections between different agricultural commodities can have profound impacts on price projections. For example, the price of one commodity might influence another due to substitution effects or shared supply chain dependencies.

  • Modeling Approach: To account for these interconnections, one could employ multivariate analysis techniques within machine learning models to capture the relationships between commodities. Techniques like feature engineering can be used to create new variables that explicitly represent these relationships, such as price ratios between commodities or indexes that measure overall market trends. Moreover, advanced machine learning models capable of handling multivariate time series data, such as Vector Autoregression (VAR) models or recurrent neural networks (RNNs), could be explored for capturing dynamic interdependencies more effectively.

Conclusion¶

The findings from testing various machine learning models provide a solid foundation for addressing the outlined research questions. They underline the potential of machine learning to enhance the accuracy of agricultural price predictions by effectively capturing complex patterns and relationships within the data. Incorporating socioeconomic variables and understanding the interconnections between different agricultural commodities are essential steps toward developing comprehensive and predictive models. These insights align with the goal of leveraging advanced machine learning techniques to aid decision-making processes in agriculture, ensuring economic security and sustainability.

In [ ]: